On the Reliable Identification of Plant Sequences Containing a Polyadenylation Site
نویسندگان
چکیده
It is a challenging task to predict with high reliability whether plant genomic sequences contain a polyadenylation (polyA) site or not. In this paper, we solve the task by means of a systematic machine-learning procedure applied on a dataset of 1000 Arabidopsis thaliana sequences flanking polyA sites. Our procedure consists of three steps. In the first step, we extract informative features from the sequences using the highly informative k-mer windows approach. Experiments with five classifiers show that the best performance is approximately 83%. In the second step, we improve performance to 95% by reducing the number of features using linear discriminant analysis, followed by applying the linear discriminant classifier. In the third step, we apply the transductive confidence machines approach and the receiver operating characteristic isometrics approach. The resulting two classifiers enable presetting any desired performance by dealing carefully with sequences for which it is unclear whether they contain polyA sites or not. For example, in our case study, we obtain 99% performance by leaving 26% of the sequences unclassified, and 100% performance by leaving 40% of the sequences unclassified. This is clearly useful for experimental verification of putative polyA sites in the laboratory. The novel methods in our machine-learning procedure should find applications in several areas of bioinformatics.
منابع مشابه
Different 3' end regions strongly influence the level of gene expression in plant cells.
We have investigated the functional role of a 3' end region on the expression of a reporter gene in plant cells. In stably transformed plants, expression of the reporter gene without a plant gene 3' end is variable and depends on the fortuitous presence of polyadenylation signals in the downstream sequences. When the reporter gene is flanked by pBR322 DNA, 3'-processing and polyadenylation occu...
متن کاملP-215: Discovery of A Novel APA Variant of A Human Potential Gene Based on Expressed Sequenced Tags Analysis
Background: Expressed sequence tags (ESTs) are sequences of cDNA fragments prepared from different tissue sources. There are over one million of these sequences in the publicly available database, and these sequences are believed to represent more than half of all human genes. The ESTs belong to different cDNA libraries, was prepared from one particular cell type, organ, or tumor. Therefore, th...
متن کاملUpstream sequences other than AAUAAA are required for efficient messenger RNA 3'-end formation in plants.
We have characterized the upstream nucleotide sequences involved in mRNA 3'-end formation in the 3' regions of the cauliflower mosaic virus (CaMV) 19S/35S transcription unit and a pea gene encoding ribulose-1,5-bisphosphate carboxylase small subunit (rbcS). Sequences between 57 bases and 181 bases upstream from the CaMV polyadenylation site were required for efficient polyadenylation at this si...
متن کاملMolecular identification of Dunaliella viridis Teod. strain MSV-1 utilizing rDNA ITS sequences and its growth responses to salinity and copper toxicity
In addition to biochemical, physiological and morphological analysis, molecular studies provide additional information for establishing phylogenetic relationships among different species and strains of the genus Dunaliella. In the present study, based on neighbor- joining analysis of the nuclear rDNA ITS sequence, a novel strain of the green algae Dunaliella viridis was identified from Maharlu ...
متن کاملIdentification and characterization of a NBS–LRR class resistance gene analog in Pistacia atlantica subsp. Kurdica
P. atlantica subsp. Kurdica, with the local name of Baneh, is a wild medicinal plant which grows in Kurdistan, Iran. The identification of resistance gene analogs holds great promise for the development of resistant cultivars. A PCR approach with degenerate primers designed according to conserved NBS-LRR (nucleotide binding site-leucine rich repeat) regions of known disease-resistance (R) gene...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Journal of computational biology : a journal of computational molecular cell biology
دوره 14 9 شماره
صفحات -
تاریخ انتشار 2007